Overview

Dataset statistics

Number of variables17
Number of observations1776633
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory230.4 MiB
Average record size in memory136.0 B

Variable types

NUM12
CAT4
DATE1

Reproduction

Analysis started2020-11-21 18:56:34.831341
Analysis finished2020-11-21 18:59:51.900866
Duration3 minutes and 17.07 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Time has a high cardinality: 1439 distinct values High cardinality
df_index has unique values Unique
Accident_Index has unique values Unique

Variables

df_index
Real number (ℝ≥0)

UNIQUE

Distinct count1776633
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean890240.0305319106
Minimum0
Maximum1780651
Zeros1
Zeros (%)< 0.1%
Memory size13.6 MiB

Quantile statistics

Minimum0
5-th percentile88983.6
Q1445082
median890219
Q31335387
95-th percentile1691455.4
Maximum1780651
Range1780651
Interquartile range (IQR)890305

Descriptive statistics

Standard deviation513997.7269
Coefficient of variation (CV)0.5773698208
Kurtosis-1.199892811
Mean890240.0305
Median Absolute Deviation (MAD)445153
Skewness0.0001794270646
Sum1.581629816e+12
Variance2.641936633e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20471< 0.1%
 
15732901< 0.1%
 
16797741< 0.1%
 
16818231< 0.1%
 
15937601< 0.1%
 
15958091< 0.1%
 
15917151< 0.1%
 
16019561< 0.1%
 
16040051< 0.1%
 
15978621< 0.1%
 
Other values (1776623)1776623> 99.9%
 
ValueCountFrequency (%) 
01< 0.1%
 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
ValueCountFrequency (%) 
17806511< 0.1%
 
17806501< 0.1%
 
17806491< 0.1%
 
17806481< 0.1%
 
17806471< 0.1%
 

Accident_Index
Categorical

UNIQUE

Distinct count1776633
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size13.6 MiB
200701FH10195
 
1
201401CW11492
 
1
201297UD03011
 
1
201230C000077
 
1
2005070502592
 
1
Other values (1776628)
1776628
ValueCountFrequency (%) 
200701FH101951< 0.1%
 
201401CW114921< 0.1%
 
201297UD030111< 0.1%
 
201230C0000771< 0.1%
 
20050705025921< 0.1%
 
2012160D055211< 0.1%
 
20135204010261< 0.1%
 
200501ZD302521< 0.1%
 
20089300008791< 0.1%
 
20144500131621< 0.1%
 
Other values (1776623)1776623> 99.9%
 

Length

Max length13
Median length13
Mean length13
Min length13

Longitude
Real number (ℝ)

Distinct count1243949
Unique (%)70.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean-1.4285615522063362
Minimum-7.516225
Maximum1.7620099999999999
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum-7.516225
5-th percentile-4.0589504
Q1-2.355523
median-1.3879
Q3-0.215937
95-th percentile0.5598604
Maximum1.76201
Range9.278235
Interquartile range (IQR)2.139586

Descriptive statistics

Standard deviation1.403892363
Coefficient of variation (CV)-0.9827314484
Kurtosis-0.357684417
Mean-1.428561552
Median Absolute Deviation (MAD)1.108824
Skewness-0.374797897
Sum-2538029.596
Variance1.970913768
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
-0.97761168< 0.1%
 
-1.87104357< 0.1%
 
-3.31059648< 0.1%
 
-0.10442646< 0.1%
 
-0.17344545< 0.1%
 
-1.99996744< 0.1%
 
-1.23439343< 0.1%
 
-3.24169443< 0.1%
 
-1.21669442< 0.1%
 
-0.81678942< 0.1%
 
Other values (1243939)1776155> 99.9%
 
ValueCountFrequency (%) 
-7.5162251< 0.1%
 
-7.5159331< 0.1%
 
-7.5091621< 0.1%
 
-7.5074681< 0.1%
 
-7.5072071< 0.1%
 
ValueCountFrequency (%) 
1.762011< 0.1%
 
1.7593981< 0.1%
 
1.7593822< 0.1%
 
1.7587971< 0.1%
 
1.7587221< 0.1%
 

Latitude
Real number (ℝ≥0)

Distinct count1166816
Unique (%)65.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean52.57375245386696
Minimum49.912941
Maximum60.757543999999996
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum49.912941
5-th percentile50.8234186
Q151.487541
median52.268092
Q353.464518
95-th percentile55.8368152
Maximum60.757544
Range10.844603
Interquartile range (IQR)1.976977

Descriptive statistics

Standard deviation1.451980973
Coefficient of variation (CV)0.02761798245
Kurtosis0.8039279664
Mean52.57375245
Median Absolute Deviation (MAD)0.880182
Skewness1.018398657
Sum93404263.54
Variance2.108248745
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
52.45879874< 0.1%
 
52.94971968< 0.1%
 
51.51976449< 0.1%
 
51.50669348< 0.1%
 
51.52695645< 0.1%
 
52.47068944< 0.1%
 
52.98985743< 0.1%
 
51.48207642< 0.1%
 
52.4721742< 0.1%
 
54.9684442< 0.1%
 
Other values (1166806)1776136> 99.9%
 
ValueCountFrequency (%) 
49.9129411< 0.1%
 
49.9130771< 0.1%
 
49.9141451< 0.1%
 
49.914431< 0.1%
 
49.9144881< 0.1%
 
ValueCountFrequency (%) 
60.7575441< 0.1%
 
60.7246821< 0.1%
 
60.7147741< 0.1%
 
60.7147721< 0.1%
 
60.6689211< 0.1%
 

Police_Force
Real number (ℝ≥0)

Distinct count51
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean30.745305867897308
Minimum1
Maximum98
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q17
median31
Q346
95-th percentile94
Maximum98
Range97
Interquartile range (IQR)39

Descriptive statistics

Standard deviation25.52641903
Coefficient of variation (CV)0.8302541903
Kurtosis0.3125859326
Mean30.74530587
Median Absolute Deviation (MAD)19
Skewness0.8381922888
Sum54623125
Variance651.5980684
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
126463614.9%
 
20744474.2%
 
43659983.7%
 
13657493.7%
 
6647543.6%
 
46560803.2%
 
44543433.1%
 
50509392.9%
 
4493172.8%
 
97492962.8%
 
Other values (41)98107455.2%
 
ValueCountFrequency (%) 
126463614.9%
 
3160790.9%
 
4493172.8%
 
5371502.1%
 
6647543.6%
 
ValueCountFrequency (%) 
9840870.2%
 
97492962.8%
 
9663820.4%
 
95258691.5%
 
9458200.3%
 
Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size13.6 MiB
3
1512341
2
 
241436
1
 
22856
ValueCountFrequency (%) 
3151234185.1%
 
224143613.6%
 
1228561.3%
 

Length

Max length1
Median length1
Mean length1
Min length1

Number_of_Vehicles
Real number (ℝ≥0)

Distinct count8
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.8300661982525372
Minimum1
Maximum8
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median2
Q32
95-th percentile3
Maximum8
Range7
Interquartile range (IQR)1

Descriptive statistics

Standard deviation0.700601846
Coefficient of variation (CV)0.3828286904
Kurtosis5.130288277
Mean1.830066198
Median Absolute Deviation (MAD)0
Skewness1.262738655
Sum3251356
Variance0.4908429466
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2105677859.5%
 
153798430.3%
 
31419158.0%
 
4301821.7%
 
566870.4%
 
620390.1%
 
7710< 0.1%
 
8338< 0.1%
 
ValueCountFrequency (%) 
153798430.3%
 
2105677859.5%
 
31419158.0%
 
4301821.7%
 
566870.4%
 
ValueCountFrequency (%) 
8338< 0.1%
 
7710< 0.1%
 
620390.1%
 
566870.4%
 
4301821.7%
 

Number_of_Casualties
Real number (ℝ≥0)

Distinct count8
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.3435718012667783
Minimum1
Maximum8
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile3
Maximum8
Range7
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.7560019664
Coefficient of variation (CV)0.5626807333
Kurtosis11.74557609
Mean1.343571801
Median Absolute Deviation (MAD)0
Skewness3.00196662
Sum2387034
Variance0.5715389731
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1136491176.8%
 
228453916.0%
 
3811064.6%
 
4292491.6%
 
5108210.6%
 
640190.2%
 
713920.1%
 
8596< 0.1%
 
ValueCountFrequency (%) 
1136491176.8%
 
228453916.0%
 
3811064.6%
 
4292491.6%
 
5108210.6%
 
ValueCountFrequency (%) 
8596< 0.1%
 
713920.1%
 
640190.2%
 
5108210.6%
 
4292491.6%
 

Date
Date

Distinct count4017
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size13.6 MiB
Minimum2005-01-01 00:00:00
Maximum2015-12-31 00:00:00
Histogram

Day_of_Week
Real number (ℝ≥0)

Distinct count7
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.115380610401811
Minimum1
Maximum7
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median4
Q36
95-th percentile7
Maximum7
Range6
Interquartile range (IQR)4

Descriptive statistics

Standard deviation1.923748326
Coefficient of variation (CV)0.4674533191
Kurtosis-1.187251336
Mean4.11538061
Median Absolute Deviation (MAD)2
Skewness-0.06485006723
Sum7311521
Variance3.70080762
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
629069416.4%
 
426778015.1%
 
526691215.0%
 
326611215.0%
 
225267814.2%
 
723758813.4%
 
119486911.0%
 
ValueCountFrequency (%) 
119486911.0%
 
225267814.2%
 
326611215.0%
 
426778015.1%
 
526691215.0%
 
ValueCountFrequency (%) 
723758813.4%
 
629069416.4%
 
526691215.0%
 
426778015.1%
 
326611215.0%
 

Time
Categorical

HIGH CARDINALITY

Distinct count1439
Unique (%)0.1%
Missing0
Missing (%)0.0%
Memory size13.6 MiB
17:00
 
17309
17:30
 
16496
16:00
 
15852
18:00
 
15682
15:30
 
15526
Other values (1434)
1695768
ValueCountFrequency (%) 
17:00173091.0%
 
17:30164960.9%
 
16:00158520.9%
 
18:00156820.9%
 
15:30155260.9%
 
16:30150440.8%
 
15:00139100.8%
 
08:30138850.8%
 
13:00125800.7%
 
18:30125540.7%
 
Other values (1429)162779591.6%
 

Length

Max length5
Median length5
Mean length5
Min length5

Road_Type
Real number (ℝ≥0)

Distinct count6
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.167390226343876
Minimum1
Maximum9
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q16
median6
Q36
95-th percentile6
Maximum9
Range8
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.644681763
Coefficient of variation (CV)0.3182809292
Kurtosis0.7315078305
Mean5.167390226
Median Absolute Deviation (MAD)0
Skewness-1.390747205
Sum9180556
Variance2.704978101
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
6132963574.8%
 
326218214.8%
 
11191676.7%
 
2366562.1%
 
7186081.0%
 
9103850.6%
 
ValueCountFrequency (%) 
11191676.7%
 
2366562.1%
 
326218214.8%
 
6132963574.8%
 
7186081.0%
 
ValueCountFrequency (%) 
9103850.6%
 
7186081.0%
 
6132963574.8%
 
326218214.8%
 
2366562.1%
 

Speed_limit
Real number (ℝ≥0)

Distinct count9
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean39.02097957203316
Minimum0
Maximum70
Zeros1
Zeros (%)< 0.1%
Memory size13.6 MiB

Quantile statistics

Minimum0
5-th percentile30
Q130
median30
Q350
95-th percentile70
Maximum70
Range70
Interquartile range (IQR)20

Descriptive statistics

Standard deviation14.15344896
Coefficient of variation (CV)0.3627138303
Kurtosis-0.4062700551
Mean39.02097957
Median Absolute Deviation (MAD)0
Skewness1.098379489
Sum69325960
Variance200.3201175
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
30113930164.1%
 
6028162315.9%
 
401460038.2%
 
701292787.3%
 
50583903.3%
 
20220021.2%
 
1019< 0.1%
 
1516< 0.1%
 
01< 0.1%
 
ValueCountFrequency (%) 
01< 0.1%
 
1019< 0.1%
 
1516< 0.1%
 
20220021.2%
 
30113930164.1%
 
ValueCountFrequency (%) 
701292787.3%
 
6028162315.9%
 
50583903.3%
 
401460038.2%
 
30113930164.1%
 

Light_Conditions
Real number (ℝ≥0)

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.9502750427353315
Minimum1
Maximum7
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile6
Maximum7
Range6
Interquartile range (IQR)3

Descriptive statistics

Standard deviation1.647988079
Coefficient of variation (CV)0.8450029061
Kurtosis0.5814640214
Mean1.950275043
Median Absolute Deviation (MAD)0
Skewness1.40133637
Sum3464923
Variance2.715864708
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1130155673.3%
 
434903619.6%
 
6987285.6%
 
7191451.1%
 
581680.5%
 
ValueCountFrequency (%) 
1130155673.3%
 
434903619.6%
 
581680.5%
 
6987285.6%
 
7191451.1%
 
ValueCountFrequency (%) 
7191451.1%
 
6987285.6%
 
581680.5%
 
434903619.6%
 
1130155673.3%
 

Weather_Conditions
Real number (ℝ≥0)

Distinct count9
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.5680875003447532
Minimum1
Maximum9
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile5
Maximum9
Range8
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.625423403
Coefficient of variation (CV)1.036564224
Kurtosis11.45094836
Mean1.5680875
Median Absolute Deviation (MAD)0
Skewness3.481336994
Sum2785916
Variance2.642001238
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1142154380.0%
 
221030011.8%
 
8390452.2%
 
9322811.8%
 
5258261.5%
 
4232841.3%
 
3123900.7%
 
796640.5%
 
623000.1%
 
ValueCountFrequency (%) 
1142154380.0%
 
221030011.8%
 
3123900.7%
 
4232841.3%
 
5258261.5%
 
ValueCountFrequency (%) 
9322811.8%
 
8390452.2%
 
796640.5%
 
623000.1%
 
5258261.5%
 

Road_Surface_Conditions
Real number (ℝ≥0)

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.3615670765993877
Minimum1
Maximum5
Zeros0
Zeros (%)0.0%
Memory size13.6 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q32
95-th percentile2
Maximum5
Range4
Interquartile range (IQR)1

Descriptive statistics

Standard deviation0.6184766256
Coefficient of variation (CV)0.4542388225
Kurtosis6.156321541
Mean1.361567077
Median Absolute Deviation (MAD)0
Skewness2.158361787
Sum2419005
Variance0.3825133364
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1122532669.0%
 
250133528.2%
 
4359352.0%
 
3114580.6%
 
525790.1%
 
ValueCountFrequency (%) 
1122532669.0%
 
250133528.2%
 
3114580.6%
 
4359352.0%
 
525790.1%
 
ValueCountFrequency (%) 
525790.1%
 
4359352.0%
 
3114580.6%
 
250133528.2%
 
1122532669.0%
 
Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size13.6 MiB
1
1144275
2
632322
3
 
36
ValueCountFrequency (%) 
1114427564.4%
 
263232235.6%
 
336< 0.1%
 

Length

Max length1
Median length1
Mean length1
Min length1

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

df_indexAccident_IndexLongitudeLatitudePolice_ForceAccident_SeverityNumber_of_VehiclesNumber_of_CasualtiesDateDay_of_WeekTimeRoad_TypeSpeed_limitLight_ConditionsWeather_ConditionsRoad_Surface_ConditionsUrban_or_Rural_Area
00200501BS00001-0.19117051.48909612112005-01-04317:426301221
11200501BS00002-0.21170851.52007513112005-01-05417:363304111
22200501BS00003-0.20645851.52530113212005-01-06500:156304111
33200501BS00004-0.17386251.48244213112005-01-07610:356301111
44200501BS00005-0.15661851.49575213112005-01-10221:136307121
55200501BS00006-0.20323851.51554013212005-01-11312:406301221
66200501BS00007-0.21127751.51269513212005-01-13520:406304111
77200501BS00009-0.18762351.50226013122005-01-14617:353301111
88200501BS00010-0.16734251.48342013222005-01-15722:436304111
99200501BS00011-0.20653151.51244313252005-01-15716:006301111

Last rows

df_indexAccident_IndexLongitudeLatitudePolice_ForceAccident_SeverityNumber_of_VehiclesNumber_of_CasualtiesDateDay_of_WeekTimeRoad_TypeSpeed_limitLight_ConditionsWeather_ConditionsRoad_Surface_ConditionsUrban_or_Rural_Area
177662317806422015984134815-3.41535355.257772983112015-10-28419:006606112
177662417806432015984135815-2.95809855.077953983212015-11-20616:556606222
177662517806442015984136815-3.17781854.985933983122015-11-24320:036304122
177662617806452015984137515-3.13672254.992202983212015-12-01317:156606422
177662717806462015984137615-3.26267654.987365983212015-12-02416:306304222
177662817806472015984139015-3.49938855.106659983112015-12-13102:306606742
177662917806482015984139115-3.37667155.023855983312015-12-11613:246601122
177663017806492015984139715-3.24215955.016316983212015-12-02413:506601122
177663117806502015984140215-3.38706755.163502982142015-12-23400:013706422
177663217806512015984140515-3.12338555.020580983332015-12-26712:406601222